Introduction


So you want to know the range of products being sold by your competitor. You go to their website and see all the products (along with the details) and want to compare it with your own range of products. Great! How do you do that? How do you get the details available on the website into a format in which you can analyse it?

Hmmm.. If you have these or similar questions on your mind, you have come to the right place. In this post, we will learn about web scraping using R. If you like a more structured approach, try our free online course, Web Scraping with R.

The What?


What exactly is web scraping or web mining or web harvesting? It is a technique for extracting data from websites. Remember, websites contain wealth of useful data but designed for human consumption and not data analysis. The goal of web scraping is to take advantage of the pattern or structure of web pages to extract and store data in a format suitable for data analysis.

The Why?


Now, let us understand why we may have to scrape data from the web.


The How?


Use Cases


Below are few use cases of web scraping:

Things to keep in mind…


Case Studies



course ad


HTML Basics


To be able to scrape data from websites, we need to understand how the web pages are structured. In this section, we will learn just enough HTML to be able to start scraping data from websites.

HTML, CSS & JAVASCRIPT


A web page typically is made up of the following:

  • HTML (Hyper Text Markup Language) takes care of the content. You need to have a basic knowledge of HTML tags as the content is located with these tags.
  • CSS (Cascading Style Sheets) takes care of the appearance of the contente. While you don’t need to look into the CSS of a web page, you should be able to identify the id or class that manage the appearance of content.
  • JS (Javascript) takes care of the behavior of the web page.

HTML Element


HTML element consists of a start tag and end tag with content inserted in between. They can be nested and are case insensitive.

HTML Tags


Below is a list of basic and important HTML tags you should know before you get started with web scraping.

DOM


DOM (Document Object Model) defines the logical structure of a document and the way it is accessed and manipulated. In the above image, you can that the HTML is structured as a tree and you trace path to any node or tag. We will use a similar approach in our case studies.

HTML Attributes


  • all HTML elements can have attributes
  • they provide additional information about an element
  • they are always specified in the start tag
  • usually come in name/value pairs

The class attribute is used to define equal styles for elements with same class name. HTML elements with same class name will have the same format and style. The id attribute specifies a unique id for an HTML element. It can be used on any HTML element and is case sensitive. The style attribute sets the style of an HTML element.


youtube ad


Libraries


We will be the following R packages in this tutorial.

library(robotstxt)
library(rvest)
library(xml2)
library(dplyr)
library(stringr)
library(forcats)
library(magrittr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(tibble)
library(purrr)

Best Selling Mobile Phones


In this first case study, we will scrape the details of best selling smart phones from Amazon. Our goal is to extract the following:

As mentioned earlier, we will first check if we can scrape data from the web page using paths_allowed() from the robotstxt package. We need to specify the url of the web page using the paths argument. If we can access the web page, paths_allowed() will return TRUE, else FALSE.

Since it has returned TRUE, let us go ahead and download the web page using read_html() from the xml2 package and store it in top_phones. We do this to ensure not to make repeated requests to the website which may lead to our IP address being blocked.

Brand Name


The first detail we want to extract is the brand name of the phone. If you look at the HTML code, it is nested within a hyperlink, defined by <a>. The link is inside a section identified by the class crwTitle. We will use this information to ask rvest to extract text content which will give us the brand name.

The location is specified using html_nodes() and the text extracted using hmtl_text(). Since crwTitle is a class, we use . before it but not for a as it is a HTML tag. Both the class and the tag are specified with quotes and separated by space.

If you observe the output, it includes the following:

  • brand name
  • color
  • RAM
  • storage capacity

To extract the brand name, we will use str_split() from stringr and specify the pattern \\( i.e split the string @ the first opening bracket. Since ( is a special character, we use \\ for escaping. Next, we use map_chr() from the purrr package to extract the first element from the resulting list. Finally, we remove the white space using str_trim().

The whole point of the above exercise is to show that extracting the data using rvest is just one part of web scraping. We need to spend enough time tidying and reshaping the data to get it into a format useful for data analysis.

Color


In the previous step, we observed that the data extracted from top_mobiles included the color of the mobile as well. So the location of the color in the HTML is same; within the hyperlink of the crwTitle section. But now, we want to extract the color and not the brand name.

We will split the original string @ ( and extract the second part which includes:

  • color
  • RAM
  • storage capacity

The color is separated from the rest by a comma. We will use the , to split the string and extract the color using map_chr() i.e. extract the first element from the resulting list.

Rating


Let us extract the ratings for the phones now. If you look at the HTML code, we can locate rating within the following:

  • <span>
  • <a>
  • .crwProductDetail

It is wrapped within <span> identified by the class .a-icon-alt which is inside a hyperlink in the section identified by the class .crwProductDetail.

In the outptut, you can observe the text out of 5 stars for each rating. Let us get rid of this text by selecting the first 3 characters using str_sub(). We pick the first 3 characters using the start and end arguments and supply them the values 1 and 3. Finally, we convert the rating to a number using as.numeric().

Number of Reviews


Now that we know the rating for each of the top 10 best selling smart phones, let us find out how many people have reviewed them. The number of reviews is located within the following:

  • hyperlink identified by the class .a-link-normal
  • <span> tag identified by the class .a-size-small
  • section identified by the class .crwProductDetail

We use the above information within html_nodes() to extract the data. Now let us clean it up a bit and convert it into a number instead of leaving it as a character. If you use as.numeric() directly, you will see NA in the result, the reason being the presence of comma in the number of reviews. First, we need to get rid of the comma, which we will do using str_replace(). We replace the comma with nothing as shown in the code below and then convert it into a number.

Real Price


The price is one of the most important factor when it comes to choosing a smart phone. Let us look at the price of the best selling mobile phones. Again, looking at the HTML code, the price can be located within the following:

  • <span> tag identified by the class .a-text-strike
  • section identified by the class crwPrice and .crwProductDetail

Using the above information, we can extract the price of the mobile phones which is returned as a character vector but we need to convert it to numeric if we are to analyze it further. Let use convert the price to a number using the following steps:

  • use str_trim() to remove the white spaces
  • exclude the currency information using str_sub()
  • replace the comma using str_replace()
  • remove the decimal values using str_split()
  • extract the price from the resulting list using map_chr()
  • convert the price to a number using as.numeric()

Actual Price


Deep discounts are one of the strategies adopted by ecommerce firms to drive sales. Let us look at the actual price (after discount price) of the best selling mobile phones. The discounted price can be located within the following:

  • <span> tag identified by the class .crwActualPrice
  • section identified by the class crwPrice and .crwProductDetail

Using the above information, we can extract the discounted price of the mobile phones. Let use convert the price to a number using the same steps as in the case of real price.


apps ad


IMDB Top 50


In this case study, we will extract the following details of the top 50 movies from the IMDB website:

Title


As we did in the previous case study, we will look at the HTML code of the IMDB web page and locate the title of the movies in the following way:

  • hyperlink inside <h3> tag
  • section identified with the class .lister-item-content

In other words, the title of the movie is inside a hyperlink (<a>) which is inside a level 3 heading (<h3>) within a section identified by the class .lister-item-content.

Year of Release


The year in which a movie was released can be located in the following way:

  • <span> tag identified by the class .lister-item-year
  • nested inside a level 3 heading (<h3>)
  • part of section identified by the class .lister-item-content

If you look at the output, the year is enclosed in round brackets and is a character vector. We need to do 2 things now:

  • remove the round bracket
  • convert year to class Date instead of character

We will use str_sub() to extract the year and convert it to Date using as.Date() with the format %Y. Finally, we use year() from lubridate package to extract the year from the previous step.

Certificate


The certificate given to the movie can be located in the following way:

  • <span> tag identified by the class .certificate
  • nested inside a paragraph (<p>)
  • part of section identified by the class .lister-item-content

Runtime


The runtime of the movie can be located in the following way:

  • <span> tag identified by the class .runtime
  • nested inside a paragraph (<p>)
  • part of section identified by the class .lister-item-content

Genre


The genre of the movie can be located in the following way:

  • <span> tag identified by the class .genre
  • nested inside a paragraph (<p>)
  • part of section identified by the class .lister-item-content

Rating


The rating of the movie can be located in the following way:

  • part of the section identified by the class .ratings-imdb-rating
  • nested within the section identified by the class .ratings-bar

Since rating is returned as a character vector, we will use as.numeric() to convert it into a number.

XPATH


Votes


To extract votes from the web page, we will use a different technique. In this case, we will use xpath and attributes to locate the total number of votes received by the top 50 movies.

xpath is specified using the following:

  • tab
  • attribute name
  • attribute value

In case of votes, they are the following:

  • meta
  • itemprop
  • ratingCount

Next, we are not looking to extract text value as we did in the previous examples using html_text(). Here, we need to extract the value assingned to the content attribute within the <meta> tag using html_attribute().

Finally, we convert the votes to a number using as.numeric().

Revenue


We wanted to extract both revenue and votes without using xpath but the way in which they are structured in the HTML code forced us to use xpath to extract votes. If you look at the HTML code, both votes and revenue are located inside the same tag with the same attribute name and value i.e. there is no distinct way to identify either of them.

In case of revenue, the xpath details are as follows:

  • <span>
  • name
  • nv

Next, we will use html_text() to extract the revenue.

To extract the revenue as a number, we need to do some string hacking as follows:

  • extract values that begin with $
  • omit missing values
  • convert values to character using as.character()
  • append NA where revenue is missing
  • remove $ and M
  • convert to number using as.numeric()

packages ad


Top Websites


In this case study, we will extract the following details of the 50 movst visited websites in the world:

Let us look at the code and the output first.

Surprising right! Whenever data is structured as a table in HTML, you need to specify the class .table in html_node() and it will return all the tables in the web page after which you can use html_table() to extract and convert the whole table to a data.frame in R.

Since the names of the columns are very long, we have renamed them to be concise and descriptive.

Now, let us look at the categories to which these top 50 websites belong using count() from dplyr. We will sort the result in descending order using the sort argument and assign it the value TRUE.

Let us club some of them to remove the sub categories.

Let us calculate the % of these categories and plot them.

RBI Governors


In this case study, we are going to extract the list of RBI (Reserve Bank of India) Governors. The author of this blog post comes from an Economics background and as such was intereseted in knowing the professional background of the Governors prior to their taking charge at India’s central bank.

The data in the Wikipedia page is luckily structured as a table and we can extract it using html_table(). There are 2 tables in the web page and we are interested in the second table. Using extract2() from the magrittr package, we will extract the table containing the details of the Governors.

Let us arrange the data by number of days served. The Term in office column contains this information but it also includes the text days. Let us split this column into two columns, term and days, using separate() from tidyr and then select the columns Officeholder and term and arrange it in descending order using desc().

What we are interested is in the background of the Governors? Use count() from dplyr to look at the backgound of the Governors and the respective counts.

Let us club some of the categories into Bureaucrats as they belong to the Indian Administrative/Civil Services. The missing data will be renamed as No Info. The category Career Reserve Bank of India officer is renamed as RBI Officer to make it more concise.

Hmmm.. So there were more bureaucrats than economists.

Summary


To get in depth knowledge of R & data science, you can enroll here for our free online R courses.